SAS-Abousalh-MC1

VAST 2012 Challenge
Mini-Challenge 1: Bank of Money Enterprise: Cyber Situation Awareness

 

 

Team Members:

 

Nascif Abousalh-Neto, SAS Institute, nascif.abousalhneto@sas.com     PRIMARY
Sumeyye Kazgan, SAS Institute, sumeyye.kazgan@sas.com


Student Team: NO

 

Tool(s):

 

SAS Visual Analytics Explorer

 

Video:

 

SAS_Abousalh_MC1.wmv

 

 

 

Answers to Mini-Challenge 1 Questions:

 

MC 1.1  Create a visualization of the health and policy status of the entire Bank of Money enterprise as of 2 pm BMT (BankWorld Mean Time) on February 2. What areas of concern do you observe? 

 

 

The overall situation in the BankWorld enterprise as of 2pm (BMT) in February 2 is good with a few isolated areas of concern. The heat map below shows a snapshot of the average activity for the entire network, based on their geographical distribution.

 

Activity (Average)

 

The light green, yellow and red colored areas - corresponding to activities 3 (invalid logging attempts), 4 (Full CPU usage) and 5 (Device added) - are concentrated in areas with the most human activity, based on the business hours of their respective time zones. The green areas, corresponding to a preponderance of activity values 1 (normal) and 2 (scheduled for maintenance), indicate machines that either operate without human interaction (servers) or that are outside of business hours (workstations left turned on, despite the company's business rules). This is as expected. The policy status is mostly normal across the network - with two notable exceptions. Two entire areas (later determined to correspond to regions 10 and 5) have their baseline policy status as "Moderate". This was confirmed by selecting minimum as the aggregation for the color response in the heat map below.

 

Policy Status (Minimum)

 

There are also a small number of machines reporting the more serious policy status. The tree map below also uses policy status as the color response and the business hierarchy as the grouping variable.

 

Policy Status (Maximum)

 

By using maximum as the aggregation, the presence of machines with serious violations quickly becomes apparent. The visualization shows that there is already at least one machine with a possible virus infection in the headquarters business unit - later we determined it to be a server in datacenter-2. Other regions also indicate the less serious but still concerning status levels 3 (serious policy deviation) and 4 (critical policy deviations).

 

MC 1.2  Use your visualization tools to look at how the network’s status changes over time. Highlight up to five potential anomalies in the network and provide a visualization of each. When did each anomaly begin and end? What might be an explanation of each anomaly?

 

As the situation awareness investigation has shown, there was already an infected machine in the network early on. Plotting the change in the number of logs per policy status over time showed that the number of such machines, as well as machines in the other serious policy status categories ("Serious", "Critical" and "Infected"), has increased steadily over time.

 

Logs over time by status

 

The first infected machine was identified as a compute server in data center 2, followed by a teller workstation on branch-30 of region-26 and a file server in data center 1. A look at activity over time for a sample of infected machines seemed to indicate a pattern in how infected machines progressed through all policy status stages - from "healthy" (status = 1) all the way to "infected" (status = 5). Overall it is clear that the rate of infection is accelerating. A forecast analysis predicts that the total number of infected machines is expected to surpass the 10,000 mark by 7pm on February 4th.

 

Infection forecast

 


The second anomaly detected was the significant difference in the total number of logs generated in the first log recorded time (8:15 am) and that same time on the next day. Approximately 50,000 machines seemed to be missing on the first day. The difference was concentrated in the "computer" and "multiple" subcategories of the "server" machine class.

 

 

We verified that the "new" logs were indeed coming from unique machines and not from machines generating multiple logs. Using data brushing on the logs from the affected timestamps against coordinates-based frequency charts indicated that all extra logs seemed to be originating from the same location - latitude 66.64, longitude -100.56. With this information we filtered the data even further and proceeded to load the result in a table, adding business unit and facility. A quick scan showed that all logs originated on the same facility, data center 5. A new chart of frequency by location filtered on the timestamps of interest confirmed this hypothesis.

 

 

We believe the machines on data center 5 were turned off for schedule maintenance - which explains why they were "silent" on the morning of February 2nd.

 


 

The number of logs generated by ATM machines declines and later recovers on the night between February 2nd and February 3rd in region 25. In all other regions, the frequency of logs generated by ATMs over time is stable. The change in frequency in all regions is limited to at most five machines with the exception of this region. The decline starts on February 2nd around 12:00 pm and ends on February 3rd around 3:45 am.

 

 

Region 25 (latitudes 47W - 52W) is situated on the first BankWorld time zone. It means that the local time for the observed anomaly was around 9:00 am in February 3rd. The situation went back to normal around 12:45am of the same day. By comparing the logs on specific timestamps grouped by facility we were able to identify the specific facilities in region 25 that were affected by this anomaly.

 

 

As in the previous scenario, we believe these machines were likely turned off to undergo regular maintenance. We were not able to confirm it by examining the corresponding activity flag, but this could be just a reflection of the reduced amount of information available.

 


 

When looking at policy status, we found that there are two regions (5 and 10) that start with all machines in the "moderate" state. Unlike all the other machines in the network, none of the machines in these two regions show up as "healthy" over the entire time period. Therefore, the anomaly began in February 2 at 8:15am and ended in February 4 at 8:00am.

 

 

We compared the breakdown of policy status of these two regions with the remaining large regions, and found an interesting symmetry at the aggregated level.

 

 

This visualization shows that total log message count for each region is consistent. Only the breakdown differs. When we added up the number of "healthy" and "moderate" machines in any region except for region 5 and 10, the result was roughly the number of "moderate" machines in region 5 and 10. So, one possible explanation is that even if the machines are "healthy", their status is reported as "moderate". There might be a software bug in how the policy status is calculated for machines in these two regions. Another possibility is that a local network issue is affecting all these machines, like a misconfigured regional domain name server.

 


 

The distribution of the number of connections among machine functions is constant - with one exception: when controlled for the "teller" function, we found that only machines in region 10 have maximum values above 50. In fact they go all the way to 100, twice as much as the maximum for all other machines in the network.

 

 

A look at these values over time shows that they are restricted to a window of 4 hours on February 3rd between 8:00 am and 12:00 pm BMT - a time where the activity was supposed to be much lower. We found that during this period almost all branches assigned to this region had at least some teller machines reporting connections at this level and that even the medians were above 60. There is a similar period of extraordinary activity in the previous day but less consistent.

 

 

What could cause the number of connections in so many teller machines to increase so dramatically? Maybe there was a bank run caused by bad economic news. If so, then the higher number of connections would be a reflection of an unusual number of bank customers going to the tellers to withdraw their money from their bank accounts.